Machine Learning Analysis Pipeline
EDR: Dataset Loading & Preprocessing
EDR – Train/Test Overview
• Train shape: (17536, 20) | Test shape: (1535, 20)
• Total train samples: 17,536 | Total test samples: 1,535
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (17536, 20) | Test shape: (1535, 20)
• Total train samples: 17,536 | Total test samples: 1,535
• Number of features: 18
• Target column: 'label'
• Missing values (train): 0 | (test): 0
EDR – Train Class Distribution
• 0: 16,679
• 1: 857
• Class balance (minority/majority): 5.1382%
• 0: 16,679
• 1: 857
• Class balance (minority/majority): 5.1382%
EDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
Baseline (Most-Frequent) Accuracy: 0.9518
EDR: Model Performance Comparison
EDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.9303 | 0.5849 | 0.2381 | 0.2027 | 0.2190 | 0.6400 | 0.1425 |
| Random Forest (SMOTE) | 0.9414 | 0.5715 | 0.3000 | 0.1622 | 0.2105 | 0.7280 | 0.1860 |
| LightGBM | 0.9440 | 0.5536 | 0.3000 | 0.1216 | 0.1731 | 0.8489 | 0.2193 |
| Balanced RF | 0.8925 | 0.6677 | 0.2026 | 0.4189 | 0.2731 | 0.8290 | 0.1893 |
| SGD SVM | 0.0860 | 0.5198 | 0.0501 | 1.0000 | 0.0954 | nan | nan |
| IsolationForest | 0.8397 | 0.5502 | 0.0825 | 0.2297 | 0.1214 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 1413 | 48 | 59 | 15 | 3.29% | 79.73% |
| Random Forest (SMOTE) | 1433 | 28 | 62 | 12 | 1.92% | 83.78% |
| LightGBM | 1440 | 21 | 65 | 9 | 1.44% | 87.84% |
| Balanced RF | 1339 | 122 | 43 | 31 | 8.35% | 58.11% |
| SGD SVM | 58 | 1403 | 0 | 74 | 96.03% | 0.00% |
| IsolationForest | 1272 | 189 | 57 | 17 | 12.94% | 77.03% |
Best Models by Metric
Accuracy
LightGBM
0.9440
Balanced Acc
Balanced RF
0.6677
Precision
Random Forest (SMOTE)
0.3000
Recall
SGD SVM
1.0000
F1
Balanced RF
0.2731
ROC-AUC
LightGBM
0.8489
PR-AUC
LightGBM
0.2193
Lowest False Positive Rate
LightGBM
1.44%
Lowest Miss Rate
SGD SVM
0.00%
EDR – Metrics by Model
EDR – ROC Curves
EDR – Precision–Recall Curves
EDR – Predicted Probability Distributions
EDR – Threshold Sweep
EDR: Logistic Regression – Detailed Analysis
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Confusion Matrix
EDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9599 | 0.9671 | 0.9635 | 1461.0000 |
| 1 | 0.2381 | 0.2027 | 0.2190 | 74.0000 |
| accuracy | nan | nan | 0.9303 | 1535.0000 |
EDR – Logistic Regression: Feature Importance
EDR – Logistic Regression: Feature Importance
EDR: Random Forest (SMOTE) – Detailed Analysis
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Confusion Matrix
EDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9585 | 0.9808 | 0.9696 | 1461.0000 |
| 1 | 0.3000 | 0.1622 | 0.2105 | 74.0000 |
| accuracy | nan | nan | 0.9414 | 1535.0000 |
EDR – Random Forest (SMOTE): Feature Importance
EDR – Random Forest (SMOTE): Feature Importance
EDR: LightGBM – Detailed Analysis
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Confusion Matrix
EDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9568 | 0.9856 | 0.9710 | 1461.0000 |
| 1 | 0.3000 | 0.1216 | 0.1731 | 74.0000 |
| accuracy | nan | nan | 0.9440 | 1535.0000 |
EDR – LightGBM: Feature Importance
EDR – LightGBM: Feature Importance
EDR: Balanced RF – Detailed Analysis
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Confusion Matrix
EDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9689 | 0.9165 | 0.9420 | 1461.0000 |
| 1 | 0.2026 | 0.4189 | 0.2731 | 74.0000 |
| accuracy | nan | nan | 0.8925 | 1535.0000 |
EDR – Balanced RF: Feature Importance
EDR – Balanced RF: Feature Importance
EDR: SGD SVM – Detailed Analysis
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Confusion Matrix
EDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 1.0000 | 0.0397 | 0.0764 | 1461.0000 |
| 1 | 0.0501 | 1.0000 | 0.0954 | 74.0000 |
| accuracy | nan | nan | 0.0860 | 1535.0000 |
EDR – SGD SVM: Feature Importance
EDR – SGD SVM: Feature Importance
EDR: IsolationForest – Detailed Analysis
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Confusion Matrix
EDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9571 | 0.8706 | 0.9118 | 1461.0000 |
| 1 | 0.0825 | 0.2297 | 0.1214 | 74.0000 |
| accuracy | nan | nan | 0.8397 | 1535.0000 |
EDR – IsolationForest: Feature Importance
Feature importance not available for this model type.
XDR: Dataset Loading & Preprocessing
XDR – Train/Test Overview
• Train shape: (17536, 34) | Test shape: (1535, 34)
• Total train samples: 17,536 | Total test samples: 1,535
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
• Train shape: (17536, 34) | Test shape: (1535, 34)
• Total train samples: 17,536 | Total test samples: 1,535
• Number of features: 32
• Target column: 'label'
• Missing values (train): 0 | (test): 0
XDR – Train Class Distribution
• 0: 16,679
• 1: 857
• Class balance (minority/majority): 5.1382%
• 0: 16,679
• 1: 857
• Class balance (minority/majority): 5.1382%
XDR – Feature Preparation
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
• Target encoding: {0: 0, 1: 1}
• Data preprocessing: Infinite values handled, missing values filled with train medians
• Feature scaling: StandardScaler (fit on train, applied to test)
Baseline (Most-Frequent) Accuracy: 0.9518
XDR: Model Performance Comparison
XDR – Model Performance Metrics
| Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.7420 | 0.6271 | 0.0934 | 0.5000 | 0.1574 | 0.6204 | 0.1363 |
| Random Forest (SMOTE) | 0.9414 | 0.5587 | 0.2778 | 0.1351 | 0.1818 | 0.6934 | 0.1801 |
| LightGBM | 0.9492 | 0.5628 | 0.4167 | 0.1351 | 0.2041 | 0.8370 | 0.2122 |
| Balanced RF | 0.9036 | 0.6928 | 0.2394 | 0.4595 | 0.3148 | 0.8368 | 0.1678 |
| SGD SVM | 0.1173 | 0.5299 | 0.0512 | 0.9865 | 0.0973 | nan | nan |
| IsolationForest | 0.9199 | 0.5474 | 0.1449 | 0.1351 | 0.1399 | nan | nan |
Confusion Matrix Analysis
| Model | TN | FP | FN | TP | FP Rate | Miss Rate |
|---|---|---|---|---|---|---|
| Logistic Regression | 1102 | 359 | 37 | 37 | 24.57% | 50.00% |
| Random Forest (SMOTE) | 1435 | 26 | 64 | 10 | 1.78% | 86.49% |
| LightGBM | 1447 | 14 | 64 | 10 | 0.96% | 86.49% |
| Balanced RF | 1353 | 108 | 40 | 34 | 7.39% | 54.05% |
| SGD SVM | 107 | 1354 | 1 | 73 | 92.68% | 1.35% |
| IsolationForest | 1402 | 59 | 64 | 10 | 4.04% | 86.49% |
Best Models by Metric
Accuracy
LightGBM
0.9492
Balanced Acc
Balanced RF
0.6928
Precision
LightGBM
0.4167
Recall
SGD SVM
0.9865
F1
Balanced RF
0.3148
ROC-AUC
LightGBM
0.8370
PR-AUC
LightGBM
0.2122
Lowest False Positive Rate
LightGBM
0.96%
Lowest Miss Rate
SGD SVM
1.35%
XDR – Metrics by Model
XDR – ROC Curves
XDR – Precision–Recall Curves
XDR – Predicted Probability Distributions
XDR – Threshold Sweep
XDR: Logistic Regression – Detailed Analysis
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Confusion Matrix
XDR – Logistic Regression: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9675 | 0.7543 | 0.8477 | 1461.0000 |
| 1 | 0.0934 | 0.5000 | 0.1574 | 74.0000 |
| accuracy | nan | nan | 0.7420 | 1535.0000 |
XDR – Logistic Regression: Feature Importance
XDR – Logistic Regression: Feature Importance
XDR: Random Forest (SMOTE) – Detailed Analysis
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Confusion Matrix
XDR – Random Forest (SMOTE): Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9573 | 0.9822 | 0.9696 | 1461.0000 |
| 1 | 0.2778 | 0.1351 | 0.1818 | 74.0000 |
| accuracy | nan | nan | 0.9414 | 1535.0000 |
XDR – Random Forest (SMOTE): Feature Importance
XDR – Random Forest (SMOTE): Feature Importance
XDR: LightGBM – Detailed Analysis
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Confusion Matrix
XDR – LightGBM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9576 | 0.9904 | 0.9738 | 1461.0000 |
| 1 | 0.4167 | 0.1351 | 0.2041 | 74.0000 |
| accuracy | nan | nan | 0.9492 | 1535.0000 |
XDR – LightGBM: Feature Importance
XDR – LightGBM: Feature Importance
XDR: Balanced RF – Detailed Analysis
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Confusion Matrix
XDR – Balanced RF: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9713 | 0.9261 | 0.9481 | 1461.0000 |
| 1 | 0.2394 | 0.4595 | 0.3148 | 74.0000 |
| accuracy | nan | nan | 0.9036 | 1535.0000 |
XDR – Balanced RF: Feature Importance
XDR – Balanced RF: Feature Importance
XDR: SGD SVM – Detailed Analysis
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Confusion Matrix
XDR – SGD SVM: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9907 | 0.0732 | 0.1364 | 1461.0000 |
| 1 | 0.0512 | 0.9865 | 0.0973 | 74.0000 |
| accuracy | nan | nan | 0.1173 | 1535.0000 |
XDR – SGD SVM: Feature Importance
XDR – SGD SVM: Feature Importance
XDR: IsolationForest – Detailed Analysis
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Confusion Matrix
XDR – IsolationForest: Classification Report
| Model | precision | recall | f1 | support |
|---|---|---|---|---|
| 0 | 0.9563 | 0.9596 | 0.9580 | 1461.0000 |
| 1 | 0.1449 | 0.1351 | 0.1399 | 74.0000 |
| accuracy | nan | nan | 0.9199 | 1535.0000 |
XDR – IsolationForest: Feature Importance
Feature importance not available for this model type.